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Since the Coronavirus disease 2019 (COVID-19) pandemic hit the world, it 
had a significant negative impact on individuals, governments, and the 
global economy. One way to reduce the negative impact of COVID-19 is to 
vaccinate. Briefly, vaccination aims to enable the formed immune system to 
remember the characteristics of the targeted viral pathogen and be able to 
initiate an immune response that is rapid and strong enough to defeat future 
live viral pathogens. However, there are still many people in the world who 
are anti-vaccine. This certainly greatly hampers the process of accelerating 
the formation of the body's immune system widely in the community. Anti- 
vaccine people can be found on various social media platforms. Twitter was 
chosen as the data source because twitter is a common source of text for 
sentiment analysis. This study aims to analyze public sentiment on the 
COVID-19 vaccine through twitter in the form of tweets and retweets. This 


Word count study uses the Gaussian Naive Bayes method to see the results of the 
classification of sentiment analysis. The results obtained based on 
experiments prove that the Gaussian Naive Bayes method can produce an 
average accuracy of 97.48% for each vaccine dataset used. 
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1. INTRODUCTION 

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by the novel coronavirus 
severe acute respiratory syndrome (SARS) Cov2, first identified in Wuhan, China in December 2019 [1]. At 
the beginning of August 2021 worldwide, the cumulative number of confirmed positive cases was 
200,702,075, while the death toll was 263,985 [2]. The covid19 virus can be transmitted by close contact or 
even by droplets between individuals [3]. In 2020, since the COVID19 pandemic hit the world, it has had a 
significant negative impact on individuals, governments, and the global economy [4]. The whole world is 
now competing to reduce the negative impact on their respective countries. One way to reduce the negative 
impact of COVID-19 is to get vaccinated. Many COVID-19 vaccines are being circulated around the world 
under different brands such as Sinovac, Moderna, Sinoparhm and Pfizer. As of August 2021, global vaccine 
data shows 1,172,440,018 (15%) doses of vaccine have been administered [5]. In summary, vaccination aims 
to enable the immune system to be trained to remember the characteristics of the targeted viral pathogen and 
to be able to initiate an immune response that is fast and strong enough to defeat the viral pathogen withdraw 
in the future [6]. 

However, there are nevertheless many humans withinside the global who're anti-vaccine. This truly 
significantly hampers the procedure of increasing the formation of the body's immune gadget broadly 
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withinside the community. Anti-vaccine humans may be determined on numerous social media systems, 
consisting of Twitter. Twitter is one of the social media systems with 187 million each day lively customers 
withinside the 0.33 area of 2020 [7]. Twitter changed into selected because the statistics supply due to the 
fact Twitter is a not unusualplace supply of textual content for sentiment evaluation and sentiment evaluation 
on vaccination [8]-[10]. Sentiment evaluation has regularly been finished with the aid of using associated 
research. For example, for sentiment evaluation on COVID-19, picture sentiment airline reviews, political 
sentiment, FB comment, resort reviews, PC reviews, patron satisfaction/reviews, training e-sport, film 
reviews, polygamy or even sentiment evaluation may be used for product and carrier evaluation [11]-[20]. 

Several preceding research have mentioned comparable problems, together with in studies attempts 
to evaluate the category withinside the sentiment evaluation of Telkom merchandise from customer 
evaluations written withinside the shape of tweets on Twitter with the fashions utilized by k-nearest neighbor 
(KNN), Naive Bayes, and textual content blob [21]. Focuses on assessing Indonesian perceptions thru a 
sentiment evaluation and could decide people's perceptions of the difficulty of polygamy [22]. Another paper 
offers an ensemble-primarily totally based version for facial express recognition (FER) that mixes numerous 
category fashions that paintings for sentiment evaluation of images [23]. Another looks at discusses the 
evaluation of English feedback at the Facebook platform the use of the Naive Bayes method [24]. The 
uncooked information used on this technique are Tweets taken from Twitter concerning the COVID-19 
vaccine, Pfizer, Moderna, and AstraZeneca. Assesses Indonesian public opinion thru evaluation of the 
COVID-19 vaccine social community in January 2021 [25]. 

Sentiment analysis using the Naive Bayes algorithm with Twitter data crawl with the keyword 
'COVID-19 vaccine’ [26]. Collected data on Filipino sentiments regarding the Philippine government's efforts 
against COVID 19 using the social networking site Twitter. Natural language processing techniques are 
applied to understand common sentiments, which can assist governments in analyzing responses. Sentiments 
were annotated and trained using the Naive Bayes model to classify English and Filipino tweets [26]. 
Analyzes the sentiments of people living in India concerning the COVID-19 vaccine. The COVID-19 
pandemic has also coincided with social media companies experiencing an increase in traffic [27]. Research 
aims to use machine learning methods to extract topics and sentiments related to COVID-19 vaccination on 
Twitter using the latent dirichlet allocation (LDA) method [28]. Based on the presentation of several previous 
studies, then this study aims to analyze public sentiment on the COVID-19 vaccine (AstraZeneca, Moderna, 
Pfizer, Sinovac, and Sinopharm) through Twitter in the form of tweets and retweets using keywords which 
serves to analyze which type of vaccine has the most positive and negative sentiments along with the 
accuracy of the classification of the Gaussian Naive Bayes model used. 


2. METHOD 

This study uses a dataset from Twitter. This study performs several data preprocessing techniques, 
feature extraction, and classification. Figure 1 is the system design used in this study. For a more detailed 
explanation, see the sub-section. 


Preprocessing Feature Extraction 


Cleaning 
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Figure 1. Design system 


2.1. Twitter API and dataset 
This initial process is carried out for getting access to the Twitter application programming interface 
(API). After access is obtained, then Twitter data is collected based on the keywords (#Vaccine Aztrazeneca, 
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#Vaccine Moderna, #Vaccine Pfizer, #Vaccine Sinopharm, and #Vaccine Sinovac) that have been entered. 


The dataset collection period is carried out from 7 August 2021-7 October 2021, with each total number of 
tweet data, can be seen in Table 1. Tweets used are from people all over the world. 


Table 1. Dataset distribution 


Vaccine Covid-19 Total Tweet 
Aztrazeneca 2789 
Moderna 2757 
Pfizer 2688 
Sinopharm 2749 
Sinovac 2693 


Table 1 shows the difference in the total number of tweets for each type of vaccine, even though the 
data collection period was carried out simultaneously. This happens for several reasons, such as, on certain 
days there are rarely users who tweet about the vaccine type, it could also be that on certain days the news 
about certain vaccines is not published, or the number of the same tweet is tweeted repeatedly so that it will 
be counted once tweets. 


2.2. Pre-processing 

This process will first clean up the characters, emoticons, and symbols in the tweet data. Then the 
case folding process is carried out which changes the sentences in the dataset to lowercase letters and only 
accepts letters a-z. Furthermore, the tokenization process will be carried out where tokenization is a process 
that ends with a stemming process which changes the affixes into basic words. So that the result of pre- 
processing is a word dictionary along with other features that will be used in the next stage. 


2.3. Feature extraction 

This process will perform feature extraction which will later be used for the classification process. 
The resulting features are the number of tweet words, polarity, and subjectivity. Polarity and subjectivity 
resulted from the calculation of the number of words. So, then the two features (polarity and subjectivity) 
will determine the value limits used for the classification process. The result of this process is in the form of 
limit values of polarity and subjectivity to determine the category of tweets including positive, negative, or 
neutral sentiment values which will be classified later. 


2.3. Gaussian Naive Bayes 

In this process, classification will be carried out using a Gaussian Naive Bayes machine learning 
model. Gaussian Naive Bayes is a variant of Naive Bayes which is calculated using a normal distribution. 
The Naive Bayes method itself has recently been widely used in classification techniques, especially in social 
media networks such as Twitter by using several methods including Unigram Naive Bayes, Multinomial 
Nave Bayes, and maximum entropy classification [29]. 

Gaussian Naive Bayes itself allows classifying numerical data with Gaussian distribution and 
categorical data [30]. Gaussian Naive Bayes is easiest because it only needs to estimate the mean and 
standard deviation of the training data [30]. Calculating Gaussian Naive Bayes can be done with (1): 


P(Z|C) x P(C) 


rcD = ~ 


(1) 
where (1) shows that C is the class label, Z is the applied attribute, while P(Z|C) is the probability of the 
previous class. P(C) is the probability that occurs on the class label. P(Z) is the probability that occurs in the 


applied attribute. In this classification, the processed data collection will be classified into three classes, 
namely positive, negative, and neutral. 


mean (u) = 2a (2) 


N is the number of samples and xi is the value for each input variable in the training data. 


[EX Ci- p)? 
std (o) = = 


N 


(3) 
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N is the number of samples, xi is the i-th sample and is the average value. 
When making predictions, this parameter can be added to the Gaussian probability density function 
(4) with a new entry for the variable. 


f(x) = cle) (4) 


f(x) is a Gaussian Probability Density Function. As shown in (4) calculates the mean and standard deviation 
in the form of a numeric constant and is the input value for the input variable. 


3. RESULTS AND DISCUSSION 

This section will discuss the experimental results of this research. The number of sentiment 
categories obtained can be seen in Table 2. The AstraZeneca vaccine had the highest number of positive 
sentiments, while the Sinovac vaccine had the least number of positive sentiments. Moderna vaccines have 
the highest number of negative sentiments. If at the results of the number of positive, neutral, and negative 
sentiments, the AstraZeneca vaccine is the type of vaccine that has the best issue. 

The results of the sentiment category from tweet obtained can be seen in Table 3. Figures 2-6 are 
examples of the word count form of the resulting sentiment category (positive and negative). Word count 
serves to represent a sentence or document into a value that is used for classification. 


Table 2. Number of sentiment categories 
Number of Sentiment Categories 


Vaccine Covid-19 


Positive Neutral Negative 
Aztrazeneca 1177 942 670 
Moderna 1098 904 755 
Pfizer 1065 927 696 
Sinopharm 1133 1185 431 
Sinovac 1034 1133 526 


Table 3. Sentiment category result 


Username Tweet Sentiment category 
@pakhead New Turkish study claims that 3 doses of sinovac is more effective then 2 sino + 1 mrna. Positif 
New England Journal of Medicine: Two doses of Pfizer, AstraZeneca vaccines effective Positif 
@wcchen : : 
against COVID Delta variant. 
@MOH TT What about those who need to travel to the US or Canada and they got the Sinopharm Netral 
a vaccine which isn't accepted by the US or Canada? 
@teddyboylocsin Time for boosters, Sinovac antibodies are undetectable 6 months after inoculation. Netral 
@Reuters Health The U.S. Food and Drug Administration is expected to authorize a third booster dose of Negatif 
E COVID-19 vaccines by Pfizer 
Clinics are charging RM 350 for sinovac two doses. Singapore clinics only charge SGD 20- Negatif 


C esshimiseli 50. Where does the money 
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Figure 2. Moderna vaccine negative category word count 
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Figure 3. Pfizer vaccine negative category word count 
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Figure 4. Sinovac vaccine negative category word count 
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Figure 5. Aztrazeneca vaccine postive category word count 


The results of the average polarity and subjectivity of each type of vaccine in this study can be seen 
in Table 4. Based on the results of Table 4, then each type of vaccine will be classified based on the resulting 
sentiment to see the accurate results. The results of the comparison of sentiment accuracy obtained using the 
Gaussian Naive Bayes model and logistic regression can be seen in Table 5. 
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Figure 6. Sinopharm vaccine positive category word count 


Table 4. Polarity and subjectivity result 


Vaccine Covid-19 Polarity Subjectivity 
Aztrazeneca 0.089626687 0.321059181 
Moderna 0.070549392  0.311492526 
Pfizer 0.076261211 0.302357763 
Sinopharm 0.118422537 0.271337923 
Sinovac 0.082584714 0.271676892 


Table 5. Accuracy result 
Accuracy % 


mens Aztrazeneca Moderna Pfizer Sinopharm Sinovac 
Gaussian Naïve Bayes 98.9 97.8 97.6 97.5 95.6 
Logistic Regression 91.6 95.8 94.2 94 91.9 


It can be seen in the results of the table above that the proposed research method is superior to other 
methods used in all types of vaccine datasets used. This happens because the gaussian naive bayes process 
considers the mean and standard deviation in the probability calculation. Gaussian naive bayes also proves 
that a method that is very suitable for use in the case of sentiment analysis. For the type of vaccine that has 
the highest accuracy, namely Aztrazeneca at 98.9%, this indicates that the tweet used is indeed in the 
positive/neutral/negative category based on the polarity and subjectivity of this study. The accuracy of the 
proposed method of preprocessing-Gaussian naive Bayes has an average gap of 4% compared to the 
proposed method of preprocessing-logistic regression. This indicates the importance of a preprocessing 
process before determining the sentiment category. 

The results of the accuracy of each type of vaccine are also influenced by the average polarity and 
subjectivity. The AstraZeneca vaccine type has an average polarity of 0.08 and a subjectivity of 0.32 which 
results in the highest accuracy compared to other types of vaccines. Sinovac vaccine types have an average 
polarity of 0.08 and 0.27 subjectivity resulting in the lowest accuracy compared to other types of vaccines. It 
can be seen from the example of the two types of vaccines which have the same average polarity value but 
differ in 0.04 subjectivity, the results of which differ inaccuracy of 3.3%. Another thing is seen in the average 
polarity value of Moderna and Pfizer which is the smallest compared to others, but the subjectivity value is 
0.3, which can produce higher accuracy than the types of Sinovac and Sinopharm vaccines. While the 
Sinopharm vaccine has the highest average polarity but low subjectivity, its accuracy cannot exceed the 
AstraZeneca, Moderna, and Pfizer vaccines. So, it can be seen from the explanation above that the value of 
subjectivity has more influence on accuracy than the value of polarity. 


4. CONCLUSION 

Based on the experimental results that have been carried out, it proves that the proposed method of 
Gaussian Naive Bayes is superior to other methods. The accuracy results produced by the Gaussian Naive 
Bayes method also have a high average for all types of vaccine datasets, namely 97.48%. The value of 
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subjectivity has more influence on the accuracy results than the value of polarity. This proves that the 
proposed method in this study is very suitable for use in sentiment analysis problems. For further research, 
experiments can be carried out using other tree-based methods. 
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